{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy Zephyr 7B on AWS Inferentia2 using Amazon SageMaker\n",
    "\n",
    "This tutorial will show how easy it is to deploy Zephyr 7B on AWS Infernetia2 using Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). More details are in the [technical report](https://arxiv.org/abs/2310.16944). The model is released under the Apache 2.0 license, ensuring wide accessibility and use. We are going to show you how to:\n",
    "\n",
    "1. Setup development environment\n",
    "2. Retrieve the TGI Neuronx Image\n",
    "3. Deploy Zephyr 7B to Amazon SageMaker\n",
    "4. Run inference and chat with the model\n",
    "\n",
    "Let’s get started.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup development environment\n",
    "\n",
    "We are going to use the `sagemaker` python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/bin/pip:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html\n",
      "  from pkg_resources import load_entry_point\n",
      "\u001b[31mERROR: sagemaker 2.206.0 has requirement PyYAML~=6.0, but you'll have pyyaml 5.3.1 which is incompatible.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install transformers \"sagemaker>=2.206.0\" --upgrade --quiet"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml\n",
      "sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role\n",
      "sagemaker session region: us-east-1\n"
     ]
    }
   ],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "sess = sagemaker.Session()\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Retrieve TGI Neuronx Image\n",
    "\n",
    "The new Hugging Face TGI Neuronx DLC can be used to run inference on AWS Inferentia2. To retrieve the URI for the desired Hugging Face TGI Neuronx DLC we can use the `get_huggingface_tgi_neuronx_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face TGI Neuronx DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+neuronx&expanded=true)\n",
    "\n",
    "_Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the `get_huggingface_llm_image_uri` method. We are going to use the raw container uri instead._\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.17-neuronx-py310-ubuntu22.04\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.huggingface import get_huggingface_llm_image_uri\n",
    "\n",
    "# retrieve the llm image uri\n",
    "llm_image = get_huggingface_llm_image_uri(\n",
    "  \"huggingface-neuronx\",\n",
    "  version=\"0.0.17\"\n",
    ")\n",
    "\n",
    "# print ecr image uri\n",
    "print(f\"llm image uri: {llm_image}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Deploy Zephyr 7B to Amazon SageMaker\n",
    "\n",
    "Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) [here](https://huggingface.co/docs/optimum-neuron/package_reference/export#supported-architectures). \n",
    "In this example, we will deploy [Hugging Face Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) to Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). You can find more details in the [technical report](https://arxiv.org/abs/2310.16944).\n",
    "\n",
    "### Compiling LLMs for Inferenetia2\n",
    "\n",
    "At the time of writing, [AWS Inferentia2 does not support dynamic shapes for inference](https://awsdocs-neuron.readthedocs-hosted.com/en/v2.6.0/general/arch/neuron-features/dynamic-shapes.html#neuron-dynamic-shapes), which means that we need to specify our sequence length and batch size in advanced. \n",
    "To make it easier for customers to utilize the full power of Inferentia2, we created a [neuron model cache](https://huggingface.co/docs/optimum-neuron/guides/cache_system), which contains pre-compiled configurations for the most popular LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), number of inferentia cores (2), batch size (2), and sequence length (2048).  \n",
    "This means compiling fine-tuned checkpoints for Mistral 7B with the same configuration will take only a few minutes. Examples of this are [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).\n",
    "\n",
    "_**Note:** Currently, TGI can only load compiled checkpoints and models. We are working on an on-the-fly compilation based on the cache. This means that you can pass any model ID from the Hugging face Hub, e.g., `HuggingFaceH4/zephyr-7b-beta` if there is a cached configuration. This should be added in the next release. We update the blog here once released._\n",
    "\n",
    "For the blog we compiled `HuggingFaceH4/zephyr-7b-beta` using the following command and parameters on a `inf2.8xlarge` instance and pushed it to the hub at []():\n",
    "\n",
    "```bash\n",
    "# compile model with optimum for batch size 4 and sequence length 2048\n",
    "optimum-cli export neuron -m HuggingFaceH4/zephyr-7b-beta --batch_size 4 --sequence_length 2048 --num_cores 2 --auto_cast_type bf16 ./zephyr-7b-beta-neuron\n",
    "# push model to hub [repo_id] [local_path] [path_in_repo]\n",
    "huggingface-cli upload  aws-neuron/zephyr-7b-seqlen-2048-bs-4 ./zephyr-7b-beta-neuron ./ --exclude \"checkpoint/**\"\n",
    "# Move tokenizer to neuron model repository\n",
    "python -c \"from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta').push_to_hub('aws-neuron/zephyr-7b-seqlen-2048-bs-4')\"\n",
    "```\n",
    "\n",
    "If you are trying to compile an LLM with a configuration that is not yet cached, it can take up to 45 minutes.\n",
    "\n",
    "### Deploying TGI Neuronx Endpoint\n",
    "\n",
    "Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. Due to the current boundaries of Inferentia2, we need to make sure that the following parameters are set to the same value:\n",
    "* `MAX_CONCURRENT_REQUESTS`: Equals to the batch size, which was used to compile the model.\n",
    "* `MAX_INPUT_LENGTH`: Equals or lower than the sequence length, which was used to compile the model.\n",
    "* `MAX_TOTAL_TOKENS`: Equals to the sequence length, which was used to compile the model.\n",
    "* `MAX_BATCH_PREFILL_TOKENS`: half of the max tokens [batch_size * sequence_length] / 2\n",
    "* `MAX_BATCH_TOTAL_TOKENS`: Equals to the max tokens [batch_size * sequence_length]\n",
    "\n",
    "In addition, we need to set the `HF_MODEL_ID` pointing to the Hugging Face model ID. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "# sagemaker config & model config\n",
    "instance_type = \"ml.inf2.8xlarge\"\n",
    "health_check_timeout = 900\n",
    "batch_size = 4\n",
    "sequence_length = 2048\n",
    "\n",
    "# Define Model and Endpoint configuration parameter\n",
    "config = {\n",
    "  'HF_MODEL_ID': \"aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2\", \n",
    "  'MAX_CONCURRENT_REQUESTS': json.dumps(batch_size), \n",
    "  'MAX_INPUT_LENGTH': json.dumps(1512),\n",
    "  'MAX_TOTAL_TOKENS': json.dumps(sequence_length),\n",
    "  'MAX_BATCH_PREFILL_TOKENS': json.dumps(int(sequence_length*batch_size / 2)), \n",
    "  'MAX_BATCH_TOTAL_TOKENS': json.dumps(sequence_length*batch_size),  \n",
    "}\n",
    "\n",
    "# create HuggingFaceModel with the image uri\n",
    "llm_model = HuggingFaceModel(\n",
    "  role=role,\n",
    "  image_uri=llm_image,\n",
    "  env=config\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.inf2.8xlarge` instance type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Your model is not compiled. Please compile your model before using Inferentia.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "------------------------!"
     ]
    }
   ],
   "source": [
    "# Deploy model to an endpoint\n",
    "# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy\n",
    "llm = llm_model.deploy(\n",
    "  initial_instance_count=1,\n",
    "  instance_type=instance_type,\n",
    "  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker will create our endpoint and deploy the model to it. This can takes a 10-15 minutes. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Run inference and chat with the model\n",
    "\n",
    "After our endpoint is deployed, we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined in the `parameters` attribute of the payload. You can find supported parameters in the [here](https://www.philschmid.de/sagemaker-llama-llm#5-run-inference-and-chat-with-the-model) or in the open API specification of the TGI in the [swagger documentation](https://huggingface.github.io/text-generation-inference/)\n",
    "\n",
    "The `HuggingFaceH4/zephyr-7b-beta` is a conversational chat model, meaning we can chat with it using the following prompt:\n",
    "  \n",
    "```\n",
    "<|system|>\\nYou are a friendly.</s>\\n<|user|>\\nInstruction</s>\\n<|assistant|>\\n\n",
    "```\n",
    "\n",
    "To avoid drafting the prompt, we can use the `apply_chat_template` method from the tokenizer, which expects a `messages` dictionary with the known OpenAI format and converts it into the correct format for the model. Let's see if Zephyr knows some facts about AWS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "# load the tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sure, here's an interesting fact about AWS: As of 2021, AWS has more than 200 services in its portfolio, ranging from compute power and storage to databases, analytics, and machine learning. This vast array of services allows developers and businesses to build and deploy complex applications and workflows with flexibility and agility, without having to manage the underlying infrastructure. In fact, AWS's extensive service offerings have contributed to its dominance in the cloud computing market, with a market share of over 30% as of 2021.</s>\n"
     ]
    }
   ],
   "source": [
    "# Prompt to generate\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": \"You are the AWS expert\"},\n",
    "    {\"role\": \"user\", \"content\": \"Can you tell me an interesting fact abou AWS?\"},\n",
    "]\n",
    "prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "\n",
    "# Generation arguments\n",
    "payload = {\n",
    "    \"do_sample\": True,\n",
    "    \"top_p\": 0.6,\n",
    "    \"temperature\": 0.9,\n",
    "    \"top_k\": 50,\n",
    "    \"max_new_tokens\": 256,\n",
    "    \"repetition_penalty\": 1.03,\n",
    "    \"return_full_text\": False,\n",
    "    \"stop\": [\"</s>\"]\n",
    "}\n",
    "chat = llm.predict({\"inputs\":prompt, \"parameters\":payload})\n",
    "\n",
    "print(chat[0][\"generated_text\"][len(prompt):])\n",
    "# Sure, here's an interesting fact about AWS: As of 2021, AWS has more than 200 services in its portfolio, ranging from compute power and storage to databases,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Awesome, we have successfully deployed Zephyr to Amazon SageMaker on Inferentia2 and chatted with it. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Clean up\n",
    "\n",
    "To clean up, we can delete the model and endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm.delete_model()\n",
    "llm.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "hf",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5fcf248a74081676ead7e77f54b2c239ba2921b952f7cbcdbbe5427323165924"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
